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! Abstract 

■ Non-symmetric rectangular correlation matrices occur in many problems in 

economics. We test the method of extracting statistically meaningful correla- 
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J>^' tions between input and output variables of large dimensionality and build a 

^ . 

Q^, toy model for artificially included correlations in large random time series. The 

results are then applied to analysis of polish macroeconomic data and can be 
, used as an alternative to classical cointegration approach. 
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5^ ; 1 Multivariate modeling of time series - setting the 

stage 

Multivariate time series data are widely available in different fields like economics, finance, 
medicine or telecommunication. Building efficient multivariate models, which help under- 
standing the relation between a large number of possible causes and resulting effects, is 
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therefore crucial for many decision - making activities. 

Due to the works of Granger [Ij, BoUerslev [2j and Sims [3] Vector Autoregressive (VAR) 
and Vector GARCH (eg.BEKK,VEC) models are nowadays deeply investigated especially 
in the field of econometrics. It is believed that, the system itself should determine the 
number of relevant input and output factors. The " brute force" method involves taking 
all the possible input and output factors and systematically correlate them, hoping to find 
some signal. One can easily convince oneself, that VAR and Vector GARCH models work 
well for small number of input and output variables, however suffer from the so called "di- 
mensionality curse" i.e. blow up with just a few factors. The cross - equation correlation 
matrix contains all the information about contemporaneous correlation in a Vector model 
and may be its greatest strength and its greatest asset. Since no questionable a priori 
assumptions are imposed , fitting a Vector model allows dataset to speak for itself. Still 
without imposing any restrictions on the structure of the correlation matrix one cannot 
make a causal interpretation of the results. We believe there exist highly non-trivial sta- 
tistically meaningful correlations between two samples of non-equal size (i.e. input and 
output variables of large dimensionality), which can be then treated as "natural" restric- 
tions for the correlations matrix structure. Since however the data inside the samples can 
also be correlated, one has to remove in-the-sample correlations first and then find some 
signal (if any) outside the samples. 

2 Model description 

The detailed description of the ideas, that drive our toy model can be found in [4J. The 
authors suggested to compare the singular value spectrum of the empirical rectangular 
M xN correlation matrix with a benchmark obtained using Random Matrix Theory results 
(c.f. [5j), assuming there are no correlation between the variables. 

2.1 Notation and mathematical aspects 

Consider N input factors Xa a = 1, . . . ,N and M output factors Yq, a = 1, . . . ,M with 
the total number of observations being T. All time series are standardized to have zero 
mean and unit variance. The data can be completely different or be the same variables 
but observed at different times. 

To remove the correlations inside each sample we form two correlation matrices, which 
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contain information about in-the-sample correlations. 

Cx = ^X^X Cy = ^Y^Y (1) 

The matrices are then diagonahzed and the empirical spectrum is compared to the theo- 
retical Marcenko-Pastur spectrum [6], [7], [8]. This allows to find and extract statistically 
significant factors. The eigenvalues, which lie much below the lower edge of the Marcenko- 
Pastur spectrum represent the redundant factors, rejected by the system. They can be 
excluded from further analysis, which slightly reduces the dimensionality of the problem 
(i.e. one gets rid of spurious correlations). However before doing that, one has to create a 
set of uncorrelated unit variance input variables X and output variables Y. 

Xat = -^=V^Xt %t = -T^U^Yt (2) 

where V,U, Aq, Aq, are the corresponding eigenvectors and eigenvalues of Cx , Cy respec- 
tively. 

Now we are ready to create the M x N cross-correlation matrix G between the Y and X 

G = YX^ (3) 

which includes only the correlations between input and output factors. The singular value 
decomposition (SVD) (c.f. [9]) is used to find the empirical spectrum of eigenvalues. The 
singular value spectrum represent the strength of cross-correlations between input and 
output factors. 

2.2 Singular values and Free Random Matrix Theory 

Theoretical predictions for eigenvalue density are obtained using the Free Random Matrix 
Theory and assuming no correlations between the samples. The final result for singular 
eigenvalue density, when there are no correlations between input and output data is: 



g{s) = max ^1-^,1- S{s)+m&x ^ !^ — ^, 0^ 5{s-l)+Re ^J^^ ^^^^^ — — 

(4) 

with s being singular eigenvalues and 

j± = ^ (^NT + MT - 2MN ± 2^/ MN{T - N){T - M)^ 

Empirical results are then compared with the above benchmark. Any exceptions may 
suggest nontrivial correlations between the samples. 
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3 Applications 



Two different set of data were investigated: Polish macroeconomic data and generated set 
of data, where temporal cross - correlations are introduced by construction. 

3.1 Polish Macroeconomic data 

The analysis began with checking, whether the method described in [1] is relevant for de- 
scribing the relation between the inflation indexes for polish macroeconomic indexes and 
other polish macroeconomic data published by different government and non-government 
agencies. We have used monthly M = 13 changes of different CPI indicators as our pre- 
dicted variables (i.e. output sample Y) and = 48 monthly changes of economic indicators 
(eg. sectoral employment, foreign exchange reserves, PPI's) as explanatory variables. The 
investigated period was between 01.1999-08.2007 (i.e. T = 104). The data were standard- 
ized, but the factors in input and output samples were not selected very carefully, so the 
data could speak for themselves and system could be able to select the optimal combination 
of variables. The next step involved cleaning internal correlations in each sample. To do 
it, we have used equation ([TJ.The resulting matrices were then diagonalized and two sets 
of internally uncorrelated data were prepared. From the uncorrelated data we create the 
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48 macroeconomic indicators - input X 13 CPI's - output Y 

Figure 1: Correlation matrices representing in-the-sample correlations 

rectangular matrix G and diagonalize it to calculate singular eigenvalues. Finally we have 
used the benchmark calculated in equation ([2]) to compare the data with the predicted 
eigenvalue density. The results show, that there exists some singular eigenvalues, which do 
not fit the benchmark. Among them, the highest singular eigenvalue si = 2.5 and the 
corresponding singular eigenvector, represent standard correlation between expenses for 
electricity and producers prices in the energy sector. There are however other non-trivial 
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relations between eg. CPI in telecommunication sector and foreign exchange reserves. 
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Matrix G - out of the sample correlations 



Singular eigenvalues of G 



Figure 2: Out of the sample correlations 



3.2 Artificially generated data - multivariate GARCH (1,1) 
process 

We also wanted to check whether the above method was able to extract temporal corre- 
lations for the data, that memorize its past realizations and, but not necessary, its past 
variances. In order to proceed a sample of 100 paths of GARCH(1,1) type and 1000 obser- 
vations were generated. The steps presented in the previous section were repeated. The 
input data were 100 GARCH(1,1) paths lagged by one observation, and the output data 
were represented by the original set of variables. As a result we got one eigenvalue, which 
do not fit well the assumed benchmark and is suspected to represent the memory of the 
data. However further test to confirm the idea are still necessary. 
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Figure 3: Singular eigenvalues from the GARCH process compared to the benchmark 
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Conclusions and Future Work 



Both examples show that there exists non - trivial correlation structure between input and 
output variables. Though redundant factors add significant amount of noise in the problem, 
the SVD decomposition allows to find only truly informative factors. This might be helpful 
in analyzing the effect of so called sunspot or spurious correlations and investigation of 
correlations between different stock exchanges, and will be the part of our future work. 
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